Markovian domain fingerprinting in statistical segmentation of protein sequences

ABSTRACT

Apparatus for automatic segmentation of non-aligned data sequences comprising structural domains to identify and construct models of the structural domains. The apparatus comprises a soft clustering unit, a refinement unit and an annealing unit. The soft clustering unit iteratively partitions the data sequences and trains variable memory Markov sources, created using a prediction suffix tree data structure, on the data until convergence is reached. The clustering unit also eliminates sources showing low relationships with the data. The refinement unit is connected to the soft clustering unit and splits and perturbs the sources following convergence, to repeat the iterative partitioning at the soft clustering unit, thereby to refine the model. The annealing unit increases the resolution with which the relationships between data and sources is shown, thereby governing the way in which less competitive sources are rejected, and the apparatus outputs the surviving variable memory Markov sources to provide models for subsequent identification of the structural domains.

FIELD OF THE INVENTION

[0001] The present invention relates to Markovian domain fingerprinting and more particularly but not exclusively to use of the same in statistical segmentation of protein sequences.

BACKGROUND OF THE INVENTION

[0002] Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional Multiple Sequence Alignment (MSA) based methods find difficulties when faced with heterogeneous groups of proteins. However, even many families of proteins that do share a common domain contain instances of several other domains, without any common underlying linear ordering. Ignoring this modularity may lead to poor or even false classification results. An automated method that can analyze a group of proteins into the sequence domains it contains is therefore highly desirable.

[0003] Numerous proteins exhibit a modular architecture, consisting of several sequence domains that often carry specific biological functions. The subject is reviewed in Bork, P. (1992) Mobile modules and motifs. Curr. Opin. Struct. Biol., 2, 413 421, and also in Bork, P. and Koonin, E. (1996) Protein sequence motifs. Curr. Opin. Struct. Biol., 6, 366 376, the contents of both of these citations hereby being incorporated by reference.

[0004] For proteins whose structure has been solved, it can be shown in many cases that the characterized sequence domains are associated with autonomous structural domains (e.g. the C2H2 zinc finger domain). Characterization of a protein family by its distinct sequence domains (also referred to herein as modules) either directly or through the use of domain motifs, or signatures, is crucial for functional annotation and correct classification of newly discovered proteins. In many cases the underlying genes may have undergone shuffling events that have led to a change in the order of modules in related proteins. In other cases a certain module may appear in many proteins, adjacent to different modules. A global alignment that ignores the modular organization of proteins may fail to associate a protein with other proteins that carry a similar functional module but in a different relative sequence location. Also, ignoring the modularity of proteins may lead to clustering of non-related proteins through false transitive associations. Thus, ideally, clustering of proteins into distinct families may be based on characterization of a common sequence domain or a common signature and not on the entire sequence, thus allowing a single sequence to be clustered into several groups in order to achieve such clustering, an unsupervised method for identification of the domains that compose a protein sequence is essential. Many methods have been proposed for classification of proteins based on their sequence characteristics. Most of them are based on a seed Multiple Sequence Alignment (MSA) of proteins that are known to be related. The MSA can then be used to characterize the family in various ways, and examples are given in the following list:

[0005] 1. by defining characteristic motifs of the functional sites (as in Hofmann, K., Bucher, P., Falquet, L. and Bairoch, A. (1999) The PROSITE database, its status in 1999. Nucleic Acids Res., 27, 215 219),

[0006] 2. by providing a fingerprint that may consist of several motifs (Attwood, T., Croning, M., Flower, D., Lewis, A., Mabey, J., Scordis, P., Selley, J. and Wright, W. (2000) PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res., 28, 225 227.),

[0007] 3. by describing a multiple alignment of a domain using a Hidden Markov Model (HMM) (Bateman, A., Birney, E., Durbin, R., Eddy, S., Howe, K. and Sonnham-mer, E. (2000) The Pfam protein families database. Nucleic Acids Res., 28, 263 266.), or

[0008] 4. by a position specific scoring matrix (Henikoff, J. G., Greene, E. A., Pietrokovski, S. and Henikoff, S. (2000) Increased coverage of protein families with the Blocks database servers. Nucleic Acids Res., 28, 228 230.).

[0009] All the above techniques, however, rely strongly on the initial selection of the related protein segments for the MSA, and the selection is generally case specific and requires expert input. The techniques also rely heavily on the quality of the MSA itself. The calculation is in general computationally intractable, and when remote sequences are included in a group of related proteins, establishment of a good MSA ceases to be an easy task and delineation of the domain boundaries proves even harder. Establishment of an MSA becomes nearly impossible for heterogeneous groups where the shared motifs are not necessarily abundant, nor in linear ordering. It is therefore highly desirable to complement these methods with efficient automatic generation of sequence signatures which can guide the classification and further analysis of the sequences. This need is especially emphasized in view of current large-scale sequencing projects, generating a vast amount of sequences that require annotation. Unsupervised segmentation of sequences, on the other hand, has become a fundamental problem with many important applications such as analysis of texts, handwriting and speech, neural spike trains and indeed bio-molecular sequences. The most common statistical approach to this problem is currently the HMM. HMMs are predefined parametric models and their success crucially depends on the correct choice of the state model. In the common application of HMMs, the architecture and topology of the model are predetermined and the memory is limited to first order. It is rather difficult to generalize these models to hierarchical structures with unknown a-priori state-topology (for an attempt see Fine, S., Singer, Y. and Tishby, N. (1998) The hierarchical hidden Markov model: analysis and applications. Mach. Learn., 32,41 62.). An interesting alternative to the HMM was proposed in Ron, D., Singer, Y. and Tishby, N. (1996) The power of amnesia: learning probabilistic automata with variable memory length. Mach. Learn., 25, 117 149, the contents of which are hereby incorporated by reference. The citation teaches a sub-class of probabilistic finite automata, the Variable Memory Markov (VMM) sources. While these models can be weaker as generative models, they have several important advantages:

[0010] (i) they capture longer correlations and higher order statistics of the sequence;

[0011] (ii) they can learn in a provably optimal sense using a construction called Prediction Suffix Tree (PST); (Ron et al., 1996; Buhlmann, P. and Wyner, A. (1999) Variable length Markov chains. Ann. Stat., 27, 480 513;

[0012] (iii) they can learn very efficiently by linear time algorithms (Apostolico, A. and Bejerano, G. (2000) Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J. Comput. Biol., 7,381 393);

[0013] (iv) their topology and complexity are determined by the data; and, specifically in our context

[0014] (v) their ability to model protein families has been demonstrated (Bejerano, G. and Yona, G. (2001) Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17,23 43).

SUMMARY OF THE INVENTION

[0015] According to a first aspect of the present invention there is thus provided apparatus for automatic segmentation of non-aligned data sequences comprising structural domains to identify of the structural domains and construct models thereof, the apparatus comprising:

[0016] a soft clustering unit for:

[0017] iteratively partitioning the data sequences and training a plurality of variable memory Markov sources thereon to reach a state of convergence, and

[0018] eliminating ones of the variable memory Markov sources showing low relationships with the data,

[0019] a refinement unit associated with the soft clustering unit for splitting and perturbing the sources, following convergence, for further iterative partitioning and eliminating at the soft clustering unit, and

[0020] an annealing unit, associated with the soft clustering unit, for successively increasing a resolution with which the relationships between data and sources is shown, thereby to render the eliminating a progressive process,

[0021] the apparatus being operable to output remaining variable memory Markov sources to provide models for subsequent identification of the structural domains.

[0022] Preferably, the sequences are biological sequences.

[0023] Preferably, the sequences are protein sequences.

[0024] Preferably, the structural domains are functional protein units.

[0025] Preferably, the sources comprise prediction suffix trees.

[0026] Preferably, the structural domains are from domain families being any one of a group comprising Pax proteins, type II DNA Topiosomerases, and glutathione S-transferases.

[0027] According to a second aspect of the present invention there is provided a method for automatic segmentation of non-aligned data sequences comprising structural domains to identify the structural domains and construct models thereof, the method comprising:

[0028] iteratively partitioning the data sequences and training a plurality of variable memory Markov sources thereon to reach a state of convergence, and

[0029] eliminating ones of the variable memory Markov sources showing low relationships with the data,

[0030] splitting and perturbing the sources, following convergence, for further iterative partitioning and eliminating, and

[0031] successively increasing a resolution with which the relationships between data and sources is shown, thereby to render the further eliminating a progressive process,

[0032] outputting remaining variable memory Markov sources to provide models for subsequent identification of the structural domains.

BRIEF DESCRIPTION OF THE DRAWINGS

[0033] For a better understanding of the invention and to show how the same may be carried into effect, reference will now be made, purely by way of example, to the accompanying drawings.

[0034] With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice. In the accompanying drawings:

[0035]FIG. 1 is a simplified diagram of a domain fingerprinting apparatus in accordance with a first embodiment of the present invention,

[0036]FIG. 2 is an example of a PST over the alphabet Σ={a,b,c,d,r},

[0037]FIG. 3 is a chart showing a segmentation algorithm according to an embodiment of the present invention,

[0038]FIG. 4 is a schematic description of the algorithm of FIG. 3,

[0039]FIGS. 5, 6, 7 and 8 are graphs showing results signatures,

[0040]FIG. 9 is a simplified diagram illustrating a protein fusion event, and

[0041]FIG. 10 is a graph showing comparative results obtained using the prior art.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0042] The present embodiments disclose a novel method, and corresponding apparatus, for the problem of protein domain detection. The method takes as input an unaligned group of protein sequences. It segments them and clusters the segments into groups sharing the same underlying statistics. A Variable Memory Markov (VMM) model is built using a Prediction Suffix Tree (PST) data structure for each group of segments. Refinement is achieved by letting the PSTs compete over the segments, and a deterministic annealing framework infers the number of underlying PST models while avoiding many inferior solutions. In examples using the above method, it is shown, by matching a unique signature to each domain, that regions having similar statistics correlate well with protein sequence domains,. The method may be carried out in a fully automated manner, and does not require or attempt an MSA, thereby avoiding the need for expert input.

[0043] Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description or illustrated in the drawings. The invention is applicable to other embodiments or of being practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting.

[0044]FIG. 1 is a simplified diagram showing apparatus for automatic segmentation of non-aligned data sequences comprising structural domains to identify of the structural domains and construct models thereof, according to a first preferred embodiment of the present invention. Apparatus 10 comprises a soft clustering unit 12, a refinement unit 14 connected thereto, an annealing unit 16 also connected to the soft clustering unit 12, and an output unit 18.

[0045] The soft clustering unit 12 carries out two tasks, firstly it iteratively partitions the data sequences and trains a plurality of variable memory Markov sources thereon to reach a state of convergence. Secondly it eliminating sources showing low relationships with the data.

[0046] The refinement unit 14 splits and perturbs the sources following convergence, and returns them to the soft clustering unit for further iterative partitioning and eliminating. The perturbed sources provide an opportunity for better convergence.

[0047] The annealing unit, successively increases the resolution with which the relationships between data and sources is shown. As this resolution increases progressively, the elimination stage becomes more discriminating and the sources that remain after elimination become better and better models in a process of natural selection.

[0048] The output stage 18 outputs the remaining variable memory Markov sources. Provided that the natural selection has been carried to a sufficient extent, the sources that remain are models or electronic signatures for actual structural features within the source material. In the case of proteins the structural features are domains, as will be explained in greater detail below.

[0049] As discussed above, the present embodiments apply a powerful extension of the VMM model and the PST algorithm, recently developed for stochastic mixtures of such models (Seldin, Y., Bejerano, G. and Tishby, N. (2001) Unsupervised sequence segmentation by a mixture of switching variable memory Markov sources. Proc. 18^(th) Intl. Conf. Mach. Learn. (ICML). Morgan Kaufmann, San Francisco, Calif., pp. 513 520, the contents of which are hereby incorporated by reference), that are able to learn in a hierarchical way using a Deterministic Annealing (DA) approach (Rose, 1998). Our model can in fact be viewed as an HMM with a VMM attached to each state, but the learning algorithm allows a completely adaptive structure and topology both for each state and for the whole model. The present embodiments are information theoretic in nature. The goal is to enable a short description of the data by a (soft) mixture of VMM models, when the complexity of each model is controlled by the data via the Minimum Description Length (MDL) principle (reviewed in Barron, A., Rissanen, J. and Yu, B. (1998) The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theor., 44, 2743 2760). In effect the embodiments cluster regions of the input sequences into groups sharing coherent statistics. A PST model is grown for each group of segments, the model being as complex as the group is statistically rich. The clustering is then refined by letting the PSTs compete over the segments. Embedding the competitive learning in a DA framework allows the embodiments to try and infer the correct number of underlying sources, and avoid many local minima. The output of the algorithm of the preferred embodiment is a set of PST models, each of which is specialized in recognizing a certain protein region. The models can then be used to detect these regions in any protein.

[0050] In Seldin, Y., Bejerano, G. and Tishby, N. (2001) Unsupervised sequence segmentation by a mixture of switching variable memory Markov sources. Proc. 18^(th) Intl. Conf. Mach. Learn. (ICML). Morgan Kaufmann, San Francisco, Calif., pp. 513 520, the contents of which are hereby incorporated by reference, the present inventors tested an embodiment of the algorithm on a mixture of interchanged running texts in five different European languages. The model was able to identify both the correct number of languages and the segmentation of the text sequence between the languages to within a few letters precision. Note that the segmentation there was not based on conserved regions (say, a few sentences, each repeating several times with minor variations), but rather based on the conserved statistics of running text segments in each language. In the present embodiments, statistical conservation is observed in the context of protein sequences.

[0051] There are clear advantages to the approach of the present embodiments compared to the common methods used for protein sequence segmentation. The method is automatic, there is no need for an alignment, the motifs themselves need not be few, abundant, or linearly ordered. When a signature is identified in a protein, its statistical significance can be quantitatively evaluated through the likelihood the model assigns to it. Given a group of related sequences the computational scheme of the present embodiments facilitates the segmentation of these sequences into domains through the use of the resulting statistical signatures, at times surpassing the susceptibility of single whole-domain HMMs. By characterizing protein families using these modular signatures it is possible to assign functional annotations to proteins that contain these modules, independent of their order in the protein. The detection of functional domains can then be used to define families and super-family hierarchies.

[0052] The examples section below shows an analysis of promising results obtained for three exemplary diverse protein families (Pax, Type II DNA Topoisomerases and GST) and compares these results with those of an alignment-based approach.

[0053] Several works precede the approach we follow in this paper. Learning a single VMM from a group of sequences using a PST model is defined in Ron et al. (1996). Strong theoretical results backing this approach when the underlying source exhibits Markovian-like properties are given in Ron et al. (1996) and Buhlmann and Wyner (1999). Equivalent algorithms of optimal linear time and space complexity for PST learning and prediction are proven in Apostolico and Bejerano (2000). In Bejerano and Yona (2001) partial groups of unaligned sequences from diverse protein families are each used as training sets. Resulting PSTs are shown to distinguish between previously unseen family members and unrelated proteins, matching that of an HMM trained on an MSA of the input sequences in sensitivity, while being much faster. Also noted there (see FIGS. 5 and 6 of Bejerano and Yona, 2001), when plotting the prediction along every residue of a protein sequence, is a correlation between protein domains and regions the family PST recognizes best within family members. That observation motivated the current work. The algorithmic approach of the present embodiments extends PST learning from single source modeling to several competing models, each specializing in regions of coherent statistics.

[0054] A statistical model T is considered, which assigns a probability P_(T)(X) to a protein sequence x=x1 . . . xl where the numbered x's are members of the amino acid set or alphabet Σ. The higher the assigned probability P_(T)(x) that the model gives, the greater is our confidence that x belongs to the protein type modeled by T. The amino acids x1 . . . xl are treated as a sequence of dependent random variables and PST modeling is built around the Markovian approximation $\begin{matrix} {{P_{T}(x)} = {\prod\limits_{J = 1}^{l}\quad {p_{T}\left( x_{j} \middle| {x_{1}\quad \ldots \quad x_{j - 1}} \right)}}} \\ {\quad {\approx {\prod\limits_{j = 1}^{l}\quad {P_{T}\left( x_{j} \middle| {{suf}_{T}\left( {x_{1}\quad \ldots \quad x_{j - 1}} \right)} \right)}}}} \end{matrix}$

[0055] where the equality follows from applying the chain rule and suf_(T) (x₁ . . . x_(j−1)) is the longest suffix of x₁ . . . x_(j−1) memorized by T during training.

[0056] Reference is now made to FIG. 2 which is an example of a PST over the alphabet={a b c d r}. The string inside each node is a memorized suffix and the adjacent vector is its probability distribution over the next symbol. A PST T is thus a data structure holding a set of short context specific probability vectors of the form P_(T)(x_(j−d) . . . x_(j−1)). An example of such a structure is shown in FIG. 2, and short patterns of arbitrary lengths are collected from training sequences regardless of the relative sequence positions of the different instances of each pattern.

[0057] As explained in Seldin et al, an MDL based variant of the PST learning is defined, which is non-parametric and is self-regularizing. It allows the PST to grow to a level of complexity proportional to the statistical richness in the sequence it models. As an input it takes a collection of protein sequences (X1 . . . Xn) and a set of weight vectors {w₁ . . . w_(n)}, where the jth entry of w_(i), denoted 0≦w_(ij)≦1, measures the degree of relatedness currently assigned between the jth element of x_(i), x_(ij), and the model it is intended to train. For example, in order to train a PST only on specific regions in the proteins, one may assign w_(ij)=1 to those specific regions and w_(ij)=0 elsewhere.

[0058] The degree of relatedness between a PST model and a sequence segment is defined as the probability the model assigns to the segment, which is to say how well the model predicts the segment. In order to partition the sequence between K=1 . . . m known PST models, one assigns sequence segments from the collection to the models in proportion to the degree of relatedness between a segment and each of the models being used. The result is a series of nm vectors ${\overset{- k}{\left\{ w_{i} \right\}}}_{\quad {i,k}}$

[0059] each representing the prediction by one model of one sequence. The vectors therefore constitute a soft partitioning of the sequence collection between the models ${\forall i},{{j\text{:}\quad {\sum\limits_{k}w_{i\quad j}^{k}}} = 1}$

[0060] ). Each model k may then be retrained using a new set of weights ${\overset{- k}{\left\{ w_{i} \right\}}}_{\quad i}.$

[0061] Such soft clustering (data repartition followed by model retraining) can be iterated until convergence to a set of PSTs, each one of which models a distinct group of sequence segments. The loop is similar to the iterative loop that is used in soft clustering of points in R^(n) to k Gaussians.

[0062] The quality of the solution that is converged to depends on the number of models and their initial settings. Both issues are solved using iterative refinement. In iterative refinement one begins with a single model T₀ which has been trained over the entire collection of sequences. T_(o) is then split into two identical replicas T₁ and T₂, which are randomly perturbed so that they differ slightly. Repartitioning and training are then repeated and, when the perturbed models converge on a new solution, splitting is repeated. Models that lose their grip on the data during the course of the repartitioning, splitting and training process are eliminated.

[0063] Finally, a resolution parameter β>0 is introduced and is gradually increased from a low initial value. The parameter β controls the hardness of the soft partition of sequence segments between the models. As β increases, segments separate more and more into distinct models.

[0064] Formally, the process sets $w_{i\quad j}^{k} = \frac{{P\left( T_{k} \right)}^{\beta \quad {S_{T_{K}}{(x_{i\quad j})}}}}{\sum\limits_{\alpha = 1}^{m}\quad {{P\left( T_{\alpha} \right)}^{\beta \quad {S_{T_{\alpha}}{(x_{i\quad j})}}}}}$

[0065] where S_(T) _(k) (x_(ij))≦0 is a log-likelihood measure of relatedness between model k and symbol x_(ij) and P(T_(k)) corresponds to the relative amount of data assigned to model k in the previous segmentation. As β increases it induces a sharper distinction between the highest scoring S_(T) _(k) (x_(ij)) and the other models for each x_(ij). The above described procedure may avoid many local minima and generally yields better solutions than other optimization algorithms. Reference is made to FIG. 3 which is a simplified flow chart illustrating the above described sequence. A schematic representation is shown in FIG. 4.

EXAMPLES

[0066] Several representative cases are analyzed below. A protein fusion event is identified, An HMM superfamily is classified into underlying families that the HMM cannot separate, and all 12 instances of a short domain in a group of 396 sequences are detected.

[0067] As discussed above, the input to the segmentation algorithm is a group of unaligned sequences in which to search for regions of one or more types of conserved statistics. In a first example of use of the present embodiments, different training sets were constructed using the Pfam (release 5.4) and Swissprot (release 38, Bairoch and Apweiler, 2000) databases. Various sequence domain families were collected from Pfam. In each Pfam family all members share a domain. An HMM detector is built for that domain based on an MSA of a seed subset of the family domain regions. The HMM is then verified to detect that domain in the remaining family members. Multi-domain proteins therefore belong to as many Pfam families as there are different characterized domains within them.

[0068] In order to build realistic, more heterogeneous sets, the present inventors collected from Swissprot the complete sequences of all chosen Pfam families. Each set now contains a certain domain in all its members, and possibly various other domains appearing anywhere within some members. Given such a set of unaligned sequences our algorithm returns as output several PST models (FIG. 3). The number of models returned is determined by the algorithm itself. Each such PST has survived repeated competitions by outperforming the other PSTs on some sequence regions. In practice two types of PSTs emerge for protein sequence data:

[0069] 1) models that significantly outperform others on relatively short regions (and generally perform poorly on most other regions), which are referred to hereinbelow as detectors, and

[0070] 2) models that perform averagely over all sequence regions, these are noise (baseline) models and are discarded automatically.

[0071] We now turn to analyze the detectors. Thus it is necessary to determine in which sequences they outperform all other models and what is the correlation between detected regions and protein domains? Several interesting results may be achieved from the analysis: First and foremost, the result may give a signature for the common domain or domains. Signatures for other domains that appear only in some proteins, may also appear. Additionally, a signature may exactly cover a domain, revealing its boundaries.

[0072] When the Pfam HMM detector cannot model below the superfamily level, it may be possible to outperform it and subdivide into the underlying biological families.

[0073] Three of the Pfam-based sets we ran experiments on have been chosen to demonstrate examples covering all the above cases. The three, very different, domain families are the Pax proteins, the type II DNA Topoisomerases and the glutathione S-transferases. Thereafter, the results are compared with those of an MSA-based approach.

[0074] Ten independent runs of the (stochastic) segmentation algorithm, implemented in C++, were carried out per family. On a Pentium III 600 MHz Linux machine clear segmentation was usually apparent within an hour or two of run time. It is recalled that each PST detector examined is run over all complete sequences in the set it was grown on in order to determine its nature. In our experiments the signature left by each PST was the same between different runs, and between different proteins sharing the same domain(s). We therefore present only the output of all detector PSTs on representative sequences in a particular run.

[0075] 3.1 The Pax family

[0076] Pax proteins (reviewed in Stuart, E. T., Kioussi, C. and Gruss, P. (1994) Mammalian Pax genes. Annu. Rev. Genet., 28, 219 236. 934) are eukaryotic transcriptional regulators that play critical roles in mammalian development and in oncogenesis. All of them contain a conserved domain of 128 amino acids called the paired or paired box domain (named after the Drosophila paired gene which is a member of the family). Some contain an additional homeobox domain that succeeds the paired domain. Pfam nomenclature names the paired domain PAX. The Pax proteins show a high degree of sequence conservation. One hundred and sixteen family members were used as a training set for the segmentation algorithm, as described above.

[0077] Reference is now made to FIG. 5, which shows Paired/PAX homeobox signatures. We superimpose the log likelihood predictions log P T of all four detector PSTs generated by the segmentation algorithm, and an exemplary baseline model (dashed), against the sequence of the PAX6 SS protein. The title holds the protein accession number. At the bottom we denote in Pfam nomenclature the location of the two experimentally verified domains. These are in near perfect match here with the high scoring sequence segments.

[0078] In FIG. 5 we superimpose the prediction of all resulting PST detectors over one representative family member. This Pax6 SS protein contains both the paired and homeobox domains. Both have matching signatures. This also serves as an example where the signatures exactly overlap the domains. The graph of family members not having the homeobox domain contains only the paired domain signature. Note that only about half the proteins contain the homeobox domain and yet its signature is very clear.

[0079] 3.2 DNA Topoisomerase II

[0080] Type II DNA topoisomerases are essential and highly conserved in all living organisms (see Roca, J. (1995) The mechanisms of DNA topoisomerases. Trends Biol. Chem., 20, 156 160, for a re-view). They catalyze the interconversion of topological isomers of DNA and are involved in a number of mechanisms, such as supercoiling and relaxation, knotting and unknotting, and catenation and decatenation. In prokaryotes the enzyme is represented by the Escherichia coli gyrase, which is encoded by two genes, gyrase A and gyrase B. The enzyme is a tetramer composed of two gyrA and two gyrB polypeptide chains. In eukaryotes the enzyme acts as a dimer, where in each monomer two distinct domains are observed. The N-terminal domain is similar in sequence to gyrase B and the C-terminal domain is similar in sequence to gyraseA (FIG. 9).

[0081]FIG. 9 is a simplified schematic diagram illustrating a protein fusion event and is adapted from Marcotte et al. (1999). The Pfam domain names are added in brackets, together with a reference to our results on a representative homolog. Comparing the PST signatures in FIGS. 6-8 with the schematic drawing of FIG. 9, it is clear that the eukaryotic signature is indeed composed of the two prokaryotic ones, in the correct order, omitting the C-terminus signature of gyrase B (short termed here as Gyr).

[0082] In Pfam 5.4 terminology gyrB and the N-terminal domain belong to the DNA topoisoII family, while gyrA and the C-terminal domain belong to the DNA topoisoIV family. Here we term the pairs gyrB/topoII and gyrA/topoIV. For the analysis we used a group of 164 sequences that included both eukaryotic topoisomerase II sequences and bacterial gyrase A and B sequences (gathered from the union of the DNA topoisoII and DNA topoisoIV Pfam 5.4 families). We successfully differentiate them into sub-classes. FIG. 6 describes a representative of the eukaryotic topoisomerase II sequences and shows the signatures for both domains, gyrB/topoII and gyrA/topoIV. FIGS. 7 and 8 demonstrate the results for representatives of the bacterial gyrase B and gyrase A proteins, respectively. The same two signatures are found in all three sequences, at the appropriate locations. Interestingly, in FIG. 7 in addition to the signature of the gyrB/topoII domain another signature appears at the C-terminal region of the sequence. This signature is compatible with a known conserved region at the C-terminus of gyrase B, that is involved in the interaction with the gyrase A molecule. The relationship between the E. coli proteins gyrA and gyrB and the yeast topoisomerase II (FIG. 9) provides a prototypical example of a fusion event of two proteins that form a complex in one organism into one protein that carries a similar function in another organism. Such examples have led to the idea that identification of such similarities may suggest the relationship between the first two proteins, either by physical interaction or by their involvement in a common pathway (Marcotte et al., 1999; Enright et al., 1999). The computational scheme we present can be useful in a search for these relationships.

[0083] 3.3 The Glutathione S-Transferases

[0084] The Glutathione S-Transferases (GST) represent a major group of detoxification enzymes (reviewed in Hayes, J. and Pulford, D. (1995) The glutathione S-transferase super-gene family: regulation of GST and the contribution of the isoen-zymes to cancer chemoprotection and drug resistance. Crit. Rev. Biochem. Mol. Biol., 30, 445 600). There is evidence that the level of expression of GST is a crucial factor in determining the sensitivity of cells to a broad spectrum of toxic chemicals. All eukaryotic species possess multiple cytosolic GST isoenzymes, each of which displays distinct binding properties. A large number of cytosolic GST isoenzymes have been purified from rat and human organs and, on the basis of their sequences they have been clustered into five separate classes designated class alpha, mu, pi, sigma, and theta GST. The hypothesis that these classes represent separate families of GST is supported by the distinct structure of their genes and their chromosomal location. The class terminology is deliberately global, attempting to include as many GSTs as possible. However, it is possible that there are sub-classes that are specific to a given organism or a group of organisms. In those sub-classes the proteins may share more than 90% sequence identity, but these relationships are masked by their inclusion in the more global class. Also, the classification of a GST protein with weak similarity to one of these classes is sometimes a difficult task. In particular, the definition of the sigma and theta classes is imprecise. Indeed, in the PRINTS database only the three classes, alpha, pi, and mu have been defined by distinct sequence signatures, while in Pfam all GSTs are clustered together, for lack of sequence dissimilarity.

[0085] In the example, three hundred and ninety six Pfam family members were segmented jointly by our algorithm, and the results were compared to those of PRINTS (as Pfam classifies all as GSTs). Five distinct signatures were found (not shown due to space limitations):

[0086] (1) A typical weak signature common to many GST proteins that contain no sub-class annotation.

[0087] (2) A sharp peak after the end of the GST domain appearing exactly in all 12 out of 396 (3%) proteins where the Elongation Factor 1 Gamma (EF1G) domain succeeds the GST domain.

[0088] (3) A clear signature common to almost all PRINTS annotated alpha and most pi GSTs. The last two signatures require more knowledge of the GST superfamily.

[0089] (4) The theta and sigma classes, which are abundant in invertebrates. It is mentioned that, as more and more of these proteins are identified it is expected that additional classes will be defined. The first evidence for a separate sigma class was obtained by sequence alignments of S-crystallins from mollusc lens tissue. Although refractory proteins in the lens probably do not have catalytic activity, they show a degree of sequence similarity to the GSTs that justifies their inclusion in this family and their classification as a separate class of sigma (Buetler, T. and Eaton, D. (1992) Glutathione S-transferases: amino acid sequence comparison, classification and phylogentic relationship. Environ. Carcinogen. Ecotoxicol. Rev., C 10, 181 203). This class, defined in PRINTS as S-crystallin, was almost entirely identified by the fourth distinct signature.

[0090] (5) Interestingly, the last distinct signature found is composed of two detector models, one from each of the previous two signatures (alpha pi and S-crystallin). Most of these two dozens proteins come from insects, and of these most are annotated to belong to the theta class. Note that many of the GSTs in insects are known to be only very distantly related to the five mammalian classes. This putative theta sub-class, the previous signatures and the undetected PRINTS mu sub-class are all currently being further investigated.

[0091] 3.4 Comparative Results

[0092] In order to evaluate the above findings we have performed three unsupervised alignment driven experiments using the same sets described above: an MSA was computed for each set using Clustal X (Linux version 1.81, Jean-mougin et al., 1998). We let Clustal X compare the level of conservation between individual sequences and the computed MSA profile in each set. Qualitatively these graphs resemble ours, apart from the fact that they do not offer separation into distinct models. As expected this straightforward approach yields less. We briefly recount some results.

[0093] Reference is now made to FIG. 10 which shows Pax MSA profile conservation. We plot the Clustal X conservation score of the PAX6 SS protein against an MSA of all Pax proteins. While the predominant paired/PAX domain is discerned, the homeobox domain (appearing in about half the sequences) is lost in the background noise. The results are to be compared with FIG. 5 where the same training set and plotted sequence are used.

[0094] The Pax alignment did not clearly elucidate the homeobox domain existing in about half the sequences. As a result, when plotting the graph comparing the same PAX6 SS protein we used in FIG. 5 against the new MSA in FIG. 10, the homeobox signal is lost in the noise.

[0095] For type II topoisomerases the picture is slightly better. The Gyrase B C-terminus unit from FIG. 7 can be discerned from the main unit, but with a much lower peak. However, the clear sum of two signatures we obtained for the eukaryotic sequences (FIG. 6) is lost here. In the last and hardest case the MSA approach tells us nothing. All GST domain graphs look nearly identical precluding any possible subdivision. And the 12 (out of 396) instances of the EF1G domain are completely lost at the alignment phase.

[0096] It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination.

[0097] It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather the scope of the present invention is defined by the appended claims and includes both combinations and subcombinations of the various features described hereinabove as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description.

1 4 1 409 PRT Cynops pyrrhogaster 1 Met Arg Asp Tyr Ile Arg Glu Thr Gln Gly Ile Ala Leu Glu Gln Phe 1 5 10 15 Asn Met Gln Asn Ser His Ser Gly Val Asn Gln Leu Gly Gly Val Phe 20 25 30 Val Asn Gly Arg Pro Leu Pro Asp Ser Thr Arg Gln Lys Ile Val Glu 35 40 45 Leu Ala His Ser Gly Ala Arg Pro Cys Asp Ile Ser Arg Ile Leu Gln 50 55 60 Val Ser Asn Gly Cys Val Ser Lys Ile Leu Gly Arg Tyr Tyr Glu Thr 65 70 75 80 Gly Ser Ile Arg Pro Arg Ala Ile Gly Gly Ser Lys Pro Arg Val Ala 85 90 95 Thr Pro Glu Val Val Ser Lys Ile Ala Gln Tyr Lys Arg Glu Cys Pro 100 105 110 Ser Ile Phe Ala Trp Glu Ile Arg Asp Arg Leu Leu Ser Glu Gly Val 115 120 125 Cys Thr Asn Asp Asn Ile Pro Ser Val Ser Ser Ile Asn Arg Val Leu 130 135 140 Arg Asn Leu Ala Ser Glu Lys Gln Gln Met Gly Ala Asp Gly Met Tyr 145 150 155 160 Asp Lys Leu Arg Met Leu Asn Gly Gln Thr Gly Thr Trp Gly Thr Arg 165 170 175 Pro Gly Trp Tyr Pro Gly Thr Ser Val Pro Gly Gln Pro Thr Pro Asp 180 185 190 Gly Cys Gln Gln Gln Glu Gly Gly Gly Glu Asn Thr Asn Ser Ile Ser 195 200 205 Ser Asn Gly Glu Asp Ser Asp Glu Ala Gln Met Arg Leu Gln Leu Lys 210 215 220 Arg Lys Leu Gln Arg Asn Arg Thr Ser Phe Thr Gln Glu Gln Ile Glu 225 230 235 240 Ala Leu Glu Lys Glu Phe Glu Arg Thr His Tyr Pro Asp Val Phe Ala 245 250 255 Arg Glu Arg Leu Ala Ala Lys Ile Asp Leu Pro Glu Ala Arg Ile Gln 260 265 270 Val Trp Phe Ser Asn Arg Arg Ala Lys Trp Arg Arg Glu Glu Lys Leu 275 280 285 Arg Asn Gln Arg Arg Gln Ala Ser Asn Thr Pro Ser His Ile Pro Ile 290 295 300 Ser Ser Ser Phe Ser Thr Ser Val Tyr Gln Pro Ile Pro Gln Pro Thr 305 310 315 320 Thr Pro Val Ser Phe Thr Ser Gly Ser Met Leu Gly Arg Thr Asp Thr 325 330 335 Ser Leu Thr Asn Thr Tyr Gly Gly Leu Pro Pro Met Pro Ser Phe Thr 340 345 350 Met Gly Asn Asn Leu Pro Met Gln Val Ser Phe Pro Leu Glu Cys Gln 355 360 365 Ser Gln Tyr Lys Phe Pro Ala Val Asn Leu Thr Cys Leu Asn Thr Gly 370 375 380 Gln Asp Tyr Ser Lys Asn Arg Ala Asn Ile Ala Asn Asp Phe Val Glu 385 390 395 400 Asn Ser Trp Met Phe Ser Ser Ile Leu 405 2 1526 PRT Cricetulus longicaudatus 2 Met Glu Leu Ser Pro Leu Gln Pro Val Asn Glu Asn Met Gln Met Asn 1 5 10 15 Lys Lys Lys Asn Glu Asp Ala Lys Lys Arg Leu Ser Ile Glu Arg Ile 20 25 30 Tyr Gln Lys Lys Thr Gln Leu Glu His Ile Leu Leu Arg Pro Asp Thr 35 40 45 Tyr Ile Gly Ser Val Glu Leu Val Thr Gln Gln Met Trp Val Tyr Asp 50 55 60 Glu Asp Val Gly Ile Asn Tyr Arg Glu Val Thr Phe Val Pro Gly Leu 65 70 75 80 Tyr Lys Ile Phe Asp Glu Ile Leu Val Asn Ala Ala Asp Asn Lys Gln 85 90 95 Arg Asp Pro Lys Met Ser Cys Ile Arg Val Thr Ile Asp Pro Glu Asn 100 105 110 Asn Leu Ile Ser Ile Trp Asn Asn Gly Lys Gly Ile Pro Val Val Glu 115 120 125 His Lys Val Glu Lys Met Tyr Val Pro Ala Leu Ile Phe Gly Gln Leu 130 135 140 Leu Thr Ser Ser Asn Tyr Asp Asp Asp Glu Lys Lys Val Thr Gly Gly 145 150 155 160 Arg Asn Gly Tyr Gly Ala Lys Leu Cys Asn Ile Phe Ser Thr Arg Phe 165 170 175 Thr Val Glu Thr Ala Ser Lys Glu Tyr Lys Lys Met Phe Lys Gln Thr 180 185 190 Trp Met Asp Asn Met Gly Arg Ala Gly Asp Met Glu Leu Lys Pro Phe 195 200 205 Asn Gly Glu Asp Tyr Thr Cys Ile Thr Phe Gln Pro Asp Leu Ser Lys 210 215 220 Phe Lys Met Gln Ser Leu Asp Lys Asp Ile Val Ala Leu Met Val Arg 225 230 235 240 Arg Ala Tyr Asp Ile Ala Gly Ser Thr Lys Asp Val Lys Val Phe Leu 245 250 255 Asn Gly Asn Lys Leu Pro Val Lys Gly Phe Arg Ser Tyr Val Asp Met 260 265 270 Tyr Leu Lys Asp Lys Leu Asp Glu Thr Gly Asn Ala Leu Lys Val Val 275 280 285 His Glu Gln Val Asn Pro Arg Trp Glu Val Cys Leu Thr Met Ser Glu 290 295 300 Lys Gly Phe Gln Gln Ile Ser Phe Val Asn Ser Ile Ala Thr Ser Lys 305 310 315 320 Gly Gly Arg His Val Asp Tyr Val Ala Asp Gln Ile Val Ser Lys Leu 325 330 335 Val Asp Val Val Lys Lys Lys Asn Lys Gly Gly Val Ala Val Lys Ala 340 345 350 His Gln Val Lys Asn His Met Trp Ile Phe Val Asn Ala Leu Ile Glu 355 360 365 Asn Pro Ser Phe Asp Ser Gln Thr Lys Glu Asn Met Thr Leu Gln Ala 370 375 380 Lys Ser Phe Gly Ser Thr Cys Gln Leu Ser Glu Lys Phe Ile Lys Ala 385 390 395 400 Ala Ile Gly Cys Gly Ile Val Glu Ser Ile Leu Asn Trp Val Lys Phe 405 410 415 Lys Ala Gln Ile Gln Leu Asn Lys Lys Cys Ser Ala Val Lys His Asn 420 425 430 Arg Ile Lys Gly Ile Pro Lys Leu Asp Asp Ala Asn Asp Ala Gly Ser 435 440 445 Arg Asn Ser Thr Glu Cys Thr Leu Ile Leu Thr Glu Gly Asp Ser Ala 450 455 460 Lys Thr Leu Ala Val Ser Gly Leu Gly Val Val Gly Arg Asp Lys Tyr 465 470 475 480 Gly Val Phe Pro Leu Arg Gly Lys Ile Leu Asn Val Arg Glu Ala Ser 485 490 495 His Lys Gln Ile Met Glu Asn Ala Glu Ile Asn Asn Ile Ile Lys Ile 500 505 510 Val Gly Leu Gln Tyr Lys Lys Asn Tyr Glu Asp Glu Asp Ser Leu Lys 515 520 525 Thr Leu Arg Tyr Gly Lys Ile Met Ile Met Thr Asp Gln Asp Gln Asp 530 535 540 Gly Ser His Ile Lys Gly Leu Leu Ile Asn Phe Ile His His Asn Trp 545 550 555 560 Pro Ser Leu Leu Arg His Arg Phe Leu Glu Glu Phe Ile Thr Pro Ile 565 570 575 Val Lys Val Ser Lys Asn Lys Gln Glu Leu Ala Phe Tyr Ser Leu Pro 580 585 590 Glu Phe Glu Glu Trp Lys Ser Ser Thr Pro Asn His Lys Lys Trp Lys 595 600 605 Val Lys Tyr Tyr Lys Gly Leu Gly Thr Ser Thr Ser Lys Glu Ala Lys 610 615 620 Glu Tyr Phe Ala Asp Met Lys Arg His Arg Ile Gln Phe Lys Tyr Ser 625 630 635 640 Gly Pro Glu Asp Asp Ala Ala Ile Ser Leu Ala Phe Ser Lys Lys Gln 645 650 655 Val Asp Asp Arg Lys Glu Trp Leu Thr His Phe Met Glu Asp Arg Arg 660 665 670 Gln Arg Lys Leu Leu Gly Leu Pro Glu Asp Tyr Leu Tyr Gly Gln Thr 675 680 685 Thr Thr Tyr Leu Thr Tyr Asn Asp Phe Ile Asn Lys Glu Leu Ile Leu 690 695 700 Phe Ser Asn Ser Asp Asn Glu Arg Ser Ile Pro Ser Met Val Asp Gly 705 710 715 720 Leu Lys Pro Gly Gln Arg Lys Val Leu Phe Thr Cys Phe Lys Arg Asn 725 730 735 Asp Lys Arg Glu Val Lys Val Ala Gln Leu Ala Gly Ser Val Gly Glu 740 745 750 Met Ser Ser Tyr His His Gly Glu Met Ser Leu Met Met Thr Ile Ile 755 760 765 Asn Leu Ala Gln Asn Phe Val Gly Ser Asn Asn Leu Asn Leu Leu Gln 770 775 780 Pro Ile Gly Gln Phe Gly Thr Arg Leu His Gly Gly Lys Asp Ser Ala 785 790 795 800 Ser Pro Arg Tyr Ile Phe Thr Met Leu Ser Pro Leu Thr Arg Leu Leu 805 810 815 Phe Pro Pro Lys Asp Asp His Thr Leu Lys Phe Leu Tyr Asp Asp Asn 820 825 830 Gln Arg Val Glu Pro Glu Trp Tyr Ile Pro Ile Ile Pro Met Val Leu 835 840 845 Ile Asn Gly Ala Glu Gly Ile Gly Thr Gly Trp Ser Cys Lys Thr Pro 850 855 860 Asn Phe Asp Ile Arg Glu Val Val Asn Asn Ile Arg Arg Leu Leu Asp 865 870 875 880 Gly Glu Glu Pro Leu Pro Met Leu Pro Ser Tyr Lys Asn Phe Lys Gly 885 890 895 Thr Ile Glu Glu Leu Ala Ser Asn Gln Tyr Val Ile Asn Gly Glu Val 900 905 910 Ala Ile Leu Asn Ser Thr Thr Ile Glu Ile Ser Glu Leu Pro Ile Arg 915 920 925 Thr Trp Thr Gln Thr Tyr Lys Glu Gln Val Leu Glu Pro Met Leu Asn 930 935 940 Gly Thr Glu Lys Thr Pro Pro Leu Ile Thr Asp Tyr Arg Glu Tyr His 945 950 955 960 Thr Asp Thr Thr Val Lys Phe Val Ile Lys Met Thr Glu Glu Lys Leu 965 970 975 Ala Glu Ala Glu Arg Val Gly Leu His Lys Val Phe Lys Leu Gln Thr 980 985 990 Ser Leu Thr Cys Asn Ser Met Val Leu Phe Asp His Val Gly Cys Leu 995 1000 1005 Lys Lys Tyr Asp Thr Val Leu Asp Ile Leu Lys Asp Phe Phe Glu 1010 1015 1020 Leu Arg Leu Lys Tyr Tyr Gly Leu Arg Lys Glu Trp Leu Leu Gly 1025 1030 1035 Met Leu Gly Ala Glu Ser Ala Lys Leu Asn Asn Gln Ala Arg Phe 1040 1045 1050 Ile Leu Glu Lys Ile Asp Gly Lys Ile Ile Ile Glu Asn Lys Pro 1055 1060 1065 Lys Lys Glu Leu Ile Lys Val Leu Ile Gln Arg Gly Tyr Asp Ser 1070 1075 1080 Asp Pro Val Lys Ala Trp Lys Glu Ala Gln Gln Lys Val Pro Asp 1085 1090 1095 Glu Glu Glu Asn Glu Glu Ser Asp Asn Glu Asn Ser Asp Ser Val 1100 1105 1110 Ala Glu Ser Gly Pro Thr Phe Asn Tyr Leu Leu Asp Met Pro Leu 1115 1120 1125 Trp Tyr Leu Thr Lys Glu Lys Lys Asp Glu Leu Cys Lys Gln Arg 1130 1135 1140 Asn Glu Lys Glu Gln Glu Leu Asn Thr Leu Lys Asn Lys Ser Pro 1145 1150 1155 Ser Asp Leu Trp Lys Glu Asp Leu Ala Val Phe Ile Glu Glu Leu 1160 1165 1170 Glu Val Val Glu Ala Lys Glu Lys Gln Asp Glu Gln Val Gly Leu 1175 1180 1185 Pro Gly Lys Gly Gly Lys Ala Lys Gly Lys Lys Ala Gln Met Ser 1190 1195 1200 Glu Val Leu Pro Ser Pro His Gly Lys Arg Val Ile Pro Gln Val 1205 1210 1215 Thr Met Glu Met Lys Ala Glu Ala Glu Lys Lys Ile Arg Lys Lys 1220 1225 1230 Ile Lys Ser Glu Asn Val Glu Gly Thr Pro Thr Glu Asn Gly Leu 1235 1240 1245 Glu Leu Gly Ser Leu Lys Gln Arg Ile Glu Lys Lys Gln Lys Lys 1250 1255 1260 Glu Pro Gly Ala Met Thr Lys Lys Gln Thr Thr Leu Ala Phe Lys 1265 1270 1275 Pro Ile Lys Lys Gly Lys Lys Arg Asn Pro Trp Ser Asp Ser Glu 1280 1285 1290 Ser Asp Met Ser Ser Asn Glu Ser Asn Val Asp Val Pro Pro Arg 1295 1300 1305 Glu Lys Asp Pro Arg Arg Ala Ala Thr Lys Ala Lys Phe Thr Met 1310 1315 1320 Asp Leu Asp Ser Asp Glu Asp Phe Ser Gly Ser Asp Gly Lys Asp 1325 1330 1335 Glu Asp Glu Asp Phe Phe Pro Leu Asp Thr Thr Pro Pro Lys Thr 1340 1345 1350 Lys Ile Pro Gln Lys Asn Thr Lys Lys Ala Leu Lys Pro Gln Lys 1355 1360 1365 Ser Ala Met Ser Gly Asp Pro Glu Ser Asp Glu Lys Asp Ser Val 1370 1375 1380 Pro Ala Ser Pro Gly Pro Pro Ala Ala Asp Leu Pro Ala Asp Thr 1385 1390 1395 Glu Gln Leu Lys Pro Ser Ser Lys Gln Thr Val Ala Val Lys Lys 1400 1405 1410 Thr Ala Thr Lys Ser Gln Ser Ser Thr Ser Thr Ala Gly Thr Lys 1415 1420 1425 Lys Arg Ala Val Pro Lys Gly Ser Lys Ser Asp Ser Ala Leu Asn 1430 1435 1440 Ala His Gly Pro Glu Lys Pro Val Pro Ala Lys Ala Lys Asn Ser 1445 1450 1455 Arg Lys Arg Lys Gln Ser Ser Ser Asp Asp Ser Asp Ser Asp Phe 1460 1465 1470 Glu Lys Val Val Ser Lys Val Ala Ala Ser Lys Lys Ser Lys Gly 1475 1480 1485 Glu Asn Gln Asp Phe Arg Val Asp Leu Asp Glu Thr Met Val Pro 1490 1495 1500 Arg Ala Lys Ser Gly Arg Ala Lys Lys Pro Ile Lys Tyr Leu Glu 1505 1510 1515 Glu Ser Asp Asp Asp Asp Leu Phe 1520 1525 3 426 PRT Human herpesvirus 6 3 Leu Gln Ser Val Phe Ala Phe Leu His Glu Lys Ile Phe Gly Val Tyr 1 5 10 15 Lys Gln Val Leu Val Gln Leu Cys Glu Tyr Ile Gly Pro Asp Leu Trp 20 25 30 Pro Phe Gly Asn Glu Arg Ser Val Ser Phe Ile Gly Tyr Pro Asn Leu 35 40 45 Trp Leu Leu Ser Val Ser Asp Leu Glu Arg Arg Val Pro Asp Thr Thr 50 55 60 Tyr Ile Cys Arg Glu Ile Leu Ser Phe Cys Gly Leu Ala Pro Ile Leu 65 70 75 80 Gly Pro Arg Gly Arg His Ala Ile Pro Val Ile Arg Glu Leu Ser Val 85 90 95 Glu Met Pro Gly Ser Glu Thr Ser Leu Gln Arg Phe Arg Phe Asn Ser 100 105 110 Gln Tyr Val Ser Ser Glu Ser Leu Cys Phe Gln Thr Gly Pro Glu Asp 115 120 125 Thr His Leu Phe Phe Ser Asp Ser Asp Met Tyr Val Val Thr Leu Pro 130 135 140 Asp Cys Leu Arg Leu Leu Leu Lys Ser Thr Val Pro Arg Ala Phe Leu 145 150 155 160 Pro Cys Phe Asp Glu Asn Ala Thr Glu Ile Glu Leu Leu Leu Lys Phe 165 170 175 Met Ser Arg Leu Gln His Arg Ser Tyr Ala Leu Phe Asp Ala Val Ile 180 185 190 Phe Met Leu Asp Ala Phe Val Ser Ala Phe Gln Arg Ala Cys Thr Leu 195 200 205 Met Glu Met Arg Trp Leu Leu Val Arg Asp Leu His Val Phe Tyr Leu 210 215 220 Thr Cys Asp Gly Lys Asp Ser His Val Val Met Pro Leu Leu Gln Thr 225 230 235 240 Ala Val Glu Asn Cys Trp Glu Lys Ile Thr Glu Ile Lys Gln Arg Pro 245 250 255 Ala Phe Gln Cys Met Glu Ile Ser Arg Cys Gly Phe Val Phe Tyr Ala 260 265 270 Arg Phe Phe Leu Ser Ser Gly Leu Ser Gln Ser Lys Glu Ala His Trp 275 280 285 Thr Val Thr Ala Ser Lys Tyr Leu Ser Ala Cys Ile Arg Ala Asn Lys 290 295 300 Thr Gly Leu Cys Phe Ala Ser Ile Thr Val Tyr Phe Gln Asp Met Met 305 310 315 320 Cys Val Phe Ile Ala Asn Arg Tyr Asn Val Ser Tyr Trp Ile Glu Glu 325 330 335 Phe Asp Pro Asn Asp Tyr Cys Leu Glu Tyr His Glu Gly Leu Leu Asp 340 345 350 Cys Ser Arg Tyr Thr Ala Val Met Ser Glu Asp Gly Gln Leu Val Arg 355 360 365 Gln Ala Arg Gly Ile Ala Leu Thr Asp Lys Ile Asn Phe Ser Tyr Tyr 370 375 380 Ile Leu Val Thr Leu Arg Val Leu Arg Arg Trp Val Glu Ser Lys Phe 385 390 395 400 Glu Asp Val Glu Gln Thr Glu Phe Ile Arg Trp Glu Asn Arg Met Leu 405 410 415 Tyr Glu His Ile His Leu Leu His Leu Asn 420 425 4 662 PRT Escherichia coli 4 Met His Arg Ala Ser Ala Asn Ser Leu Leu Asn Ser Val Ser Gly Ser 1 5 10 15 Met Met Trp Arg Asn Gln Ser Ser Gly Arg Arg Pro Ser Lys Arg Leu 20 25 30 Ser Asp Asn Glu Ala Thr Leu Ser Thr Ile Asn Ser Ile Leu Gly Ala 35 40 45 Glu Asp Met Leu Ser Lys Asn Leu Leu Ser Tyr Leu Pro Pro Asn Asn 50 55 60 Glu Glu Ile Asp Met Ile Tyr Pro Ser Glu Gln Ile Met Thr Phe Ile 65 70 75 80 Glu Met Leu His Gly His Lys Asn Phe Phe Lys Gly Gln Thr Ile His 85 90 95 Asn Ala Leu Arg Asp Ser Ala Val Leu Lys Lys Gln Ile Ala Tyr Gly 100 105 110 Val Ala Gln Ala Leu Leu Asn Ser Val Ser Ile Gln Gln Ile His Asp 115 120 125 Glu Trp Lys Arg His Val Arg Ser Phe Pro Phe His Asn Lys Lys Leu 130 135 140 Ser Phe Gln Asp Tyr Phe Ser Val Trp Ala His Ala Ile Lys Gln Val 145 150 155 160 Ile Leu Gly Asp Ile Ser Asn Ile Ile Asn Phe Ile Leu Gln Ser Ile 165 170 175 Asp Asn Ser His Tyr Asn Arg Tyr Val Asp Trp Ile Cys Thr Val Gly 180 185 190 Ile Val Pro Phe Met Arg Thr Thr Pro Thr Ala Pro Asn Leu Tyr Asn 195 200 205 Leu Leu Gln Gln Val Ser Ser Lys Leu Ile His Asp Ile Val Arg His 210 215 220 Lys Gln Asn Ile Val Thr Pro Ile Leu Leu Gly Leu Ser Ser Val Ile 225 230 235 240 Ile Pro Asp Phe His Asn Ile Lys Ile Phe Arg Asp Arg Asn Ser Glu 245 250 255 Gln Ile Ser Cys Phe Lys Asn Lys Lys Ala Ile Ala Phe Phe Thr Tyr 260 265 270 Ser Thr Pro Tyr Val Ile Arg Asn Arg Leu Met Leu Thr Thr Pro Leu 275 280 285 Ala His Leu Ser Pro Glu Leu Lys Lys His Asn Ser Leu Arg Arg His 290 295 300 Gln Lys Met Cys Gln Leu Leu Asn Thr Phe Pro Ile Lys Val Leu Thr 305 310 315 320 Thr Ala Lys Thr Asp Val Thr Asn Lys Lys Ile Met Asp Met Ile Glu 325 330 335 Lys Glu Glu Lys Asn Ser Asp Ala Lys Lys Ser Leu Ile Lys Phe Leu 340 345 350 Leu Asn Leu Ser Asp Ser Lys Ser Lys Ile Gly Ile Arg Asp Ser Val 355 360 365 Glu Gly Phe Ile Gln Glu Ile Thr Pro Ser Ile Ile Asp Gln Asn Lys 370 375 380 Leu Met Leu Asn Arg Gly Gln Phe Arg Lys Arg Ser Ala Ile Asp Thr 385 390 395 400 Gly Glu Arg Asp Val Arg Asp Leu Phe Lys Lys Gln Ile Ile Lys Cys 405 410 415 Met Glu Glu Gln Ile Gln Thr Gln Met Asp Glu Ile Glu Thr Leu Lys 420 425 430 Thr Thr Asn Gln Met Phe Glu Arg Lys Ile Lys Asp Leu His Ser Leu 435 440 445 Leu Glu Thr Asn Asn Asp Cys Asp Arg Tyr Asn Pro Asp Leu Asp His 450 455 460 Asp Leu Glu Asn Leu Ser Leu Ser Arg Ala Leu Asn Ile Val Gln Arg 465 470 475 480 Leu Pro Phe Thr Ser Val Ser Ile Asp Asp Thr Arg Ser Val Ala Asn 485 490 495 Ser Phe Phe Ser Gln Tyr Ile Pro Asp Thr Gln Tyr Ala Asp Lys Arg 500 505 510 Ile Asp Gln Leu Trp Glu Met Glu Tyr Met Arg Thr Phe Arg Leu Arg 515 520 525 Lys Asn Val Asn Asn Gln Gly Gln Glu Glu Ser Ile Thr Tyr Ser Asn 530 535 540 Tyr Ser Ile Glu Leu Leu Ile Val Pro Phe Leu Arg Arg Leu Leu Asn 545 550 555 560 Ile Tyr Asn Leu Glu Ser Ile Pro Glu Glu Phe Leu Phe Leu Ser Leu 565 570 575 Gly Glu Ile Leu Leu Ala Ile Tyr Glu Ser Ser Lys Ile Lys His Tyr 580 585 590 Leu Arg Leu Val Tyr Val Arg Glu Leu Asn Gln Ile Ser Glu Val Tyr 595 600 605 Asn Leu Thr Gln Thr His Pro Glu Asn Asn Glu Pro Ile Phe Asp Ser 610 615 620 Asn Ile Phe Ser Pro Asn Pro Glu Asn Glu Ile Leu Glu Lys Ile Lys 625 630 635 640 Arg Ile Arg Asn Leu Arg Arg Ile Gln His Leu Thr Arg Pro Asn Tyr 645 650 655 Pro Lys Gly Asp Gln Asp 660 

1. Apparatus for automatic segmentation of non-aligned data sequences comprising structural domains to identify of the structural domains and construct models thereof, the apparatus comprising: a soft clustering unit for: iteratively partitioning said data sequences and training a plurality of variable memory Markov sources thereon to reach a state of convergence, and eliminating ones of said variable memory Markov sources showing low relationships with the data, a refinement unit associated with said soft clustering unit for splitting and perturbing said sources, following convergence, for further iterative partitioning and eliminating at said soft clustering unit, and an annealing unit, associated with said soft clustering unit, for successively increasing a resolution with which said relationships between data and sources is shown, thereby to render said eliminating a progressive process, said apparatus being operable to output remaining variable memory Markov sources to provide models for subsequent identification of said structural domains.
 2. The apparatus of claim 1, wherein said sequences are biological sequences.
 3. The apparatus of claim 2, wherein said sequences are protein sequences.
 4. The apparatus of claim 3, wherein said structural domains are functional protein units.
 5. The apparatus of claim 1, wherein said sources comprise prediction suffix trees.
 6. The apparatus of claim 4, wherein said structural domains are from domain families being any one of a group comprising Pax proteins, type II DNA Topiosomerases, and glutathione S-transferases.
 7. Method for automatic segmentation of non-aligned data sequences comprising structural domains to identify the structural domains and construct models thereof, the method comprising: iteratively partitioning said data sequences and training a plurality of variable memory Markov sources thereon to reach a state of convergence, and eliminating ones of said variable memory Markov sources showing low relationships with the data, splitting and perturbing said sources, following convergence, for further iterative partitioning and eliminating, and successively increasing a resolution with which said relationships between data and sources is shown, thereby to render said further eliminating a progressive process, outputting remaining variable memory Markov sources to provide models for subsequent identification of said structural domains. 